Introduction to spatial models for citizen science data

Lionel Hertzog
30.08.2021

When are spatial models not needed

Before jumping into the extra complexities of spatial models, it is worth reflecting when we do not need spatial models:

  1. spatial field not of primary interest to the study AND
  2. spatial variation is captured by covariates included in the model

When are spatial models not needed

Formally if we have X a covariate spatially-varying and data y following:

\[ y \sim D(\beta X, ...) \]

With D being any statistical distribution. Then the spatial signal in the data will be captured by the model.

plot of chunk unnamed-chunk-1

When are spatial models not needed

Suppose that we collected data with a clear spatial signal:

plot of chunk unnamed-chunk-3

str(data)
'data.frame':   100 obs. of  4 variables:
 $ x        : num  18.313 0.334 38.248 14.021 3.95 ...
 $ y        : num  19 20.2 12.8 12 24.9 ...
 $ covariate: num  0.735 0.408 0.98 0.51 0.551 ...
 $ resp     : num  5.36 1.91 7.71 3.03 3.68 ...

When are spatial models not needed

To assess whether the covariates sufficiently capture the spatial signal, one can check for spatial autocorrelation in the residuals using DHARMa:

m <- lm(resp ~ covariate, data)
dd <- simulateResiduals(m)

plot of chunk unnamed-chunk-6

When spatial models are needed

  1. Spatial field of interest (distribution models …)
  2. Covariates cannot account for the spatial signal

Spatial models and citizen science

Biodiversity recording from citizen science programms often contain a spatial signal due to citizen reporting biodiversity records in specific areas:

  1. Easily accessible (close to roads, close to [vacation] home …)
  2. Areas with interesting species / habitats

Spatial models and citizen science

In addition, citizen science biodiversity monitoring programs give informations on where a species was recorded (and is therefore assumed to be present), but an absence of records cannot be usually assumed to represent a species absence. This type of data is usually called presence-only.

One interesting class of models to fit to presence-only data with spatial signal is: point pattern analysis.

Point pattern analysis

In point pattern analysis we have some georeferenced data and potentially some covariates.

plot of chunk unnamed-chunk-7

The locations of the data depend on an underlying intensity function \( \lambda \)

Point pattern analysis will directly try to estimate the intensity function using:

\[ log(\lambda(s))) = \beta_0 + \beta_1 * covariate(s) + S(s) \]

Point pattern analysis

Two main packages to fit point patterns in R:

  1. spatstat: not covered in this workshop
  2. INLA/inlabru

INLA introduction

INLA stands for Integrated Nested Laplace Approximation, it is an approximate Bayesian approach that offers a faster alternative to other Bayesian algorithms such as MCMC or HMC.

INLA can fit a wide range of models including:

  1. Hierarchical models (random effect)
  2. Autocorrelated models (like AR(1))
  3. Spatial models (including point patterns)
  4. Space-time models

INLA and spatial models

Spatial effect in INLA are fitted using Stochastic Partial Differential Equations (SPDE)

drawing

Picture from Krainski et al.

INLA and spatial models

The continuous spatial effect is estimated by using basis functions (like smooths) on a mesh which dramatically speed up the estimation of continuous spatial effects [more on this later].

drawing

Picture from Krainski et al.

Mesh

The first step to fit spatial models in INLA is to define a mesh of the area of interest (the domain).

mesh <- inla.mesh.2d(loc.domain = domain, max.edge = 0.1)

plot of chunk unnamed-chunk-10

Mesh

The mesh plays an important role in estimating the spatial effects, the spatial effect will be estimated at the vertices / nodes of the triangles. As a result if the spatial effect to be estimated varies at smaller scale than the mesh triangles, the model will not be able to correctly estimate the spatial effect.

plot of chunk unnamed-chunk-11

Mesh

At the other extreme too precise mesh will lead to unnecessary long computing times.

plot of chunk unnamed-chunk-12

Mesh

It is usually recommended to start with coarser mesh to debug model code, explore the results and use finer meshes for final model runs.

There is also a ShinyApp built in INLA to help defining the mesh:

meshbuilder()

drawing

Defining the SPDE

Given a mesh, to fit a spatial model in INLA/inlabru we need to define the SPDE and more specifically the priors for the standard deviation (sigma) and the scale (range) of the spatial effect.

There are various ways to define the SPDE, the recommended one is via Penalized Complexity priors that is parametrized as follow:

  1. variance of the spatial effect: \( P(\sigma > \sigma_0) = s \), where \( \sigma_0 \) and s are two values to be specified
  2. range of the spatial effect: \( P(\rho < \rho_0) = p \), where \( \rho_0 \) and p are two values to be specified

Defining the SPDE

From Fulgstad et al 2019:

In the simulation study we observe good coverage of the equal-tailed 95% credible intervals when the prior satisfies \( P(\sigma > \sigma_0) = 0.05 \) and \( P(\rho < \rho_0) = 0.05 \), where \( \sigma_0 \) is between 2.5 to 40 times the true marginal standard deviation and \( \rho_0 \) is between 0.1 and 0.4 of the true range.

Defining the SPDE

In other words if we assume a standard deviation of the spatial effect to be around 2 and a range of 1 we can set the spde in R:

spde <- inla.spde2.pcmatern(mesh,
        # P(sigma > 10) = 0.05
        prior.sigma = c(10, 0.05),
        # P(range < 0.1) = 0.05
        prior.range = c(0.1, 0.05))

Ok, but what do this sigma and range really represent?

Space = Matern

Under the hood INLA uses the Matern correlation / covariance function to estimate the spatial field.

Remember that our spatial models are usually:

\[ y(s_i) \sim D(\mu(s_i), ...) \\ \mu(s_i) = \beta X + u(s_i) \]

Space = Matern

Where \( s_i \) are the n locations where the data were sampled and u is a continuous spatial field that is commonly assumed to follow a multivariate normal distribution that can be parametrized as a gaussian process:

\[ u \sim GF(0, \Sigma) \]

The \( \Sigma \) matrix represent the correlation / covariance between the n observations, in other words if you have 100 locations, \( \Sigma \) has 100 rows and 100 columns.

Space = Matern

Fortunately we do not have to individually estimate all of these parameters, we can use the Matern function to compute the correlation / covariance between all observations given their (eucledian) distance following:

\[ cov(s_i, s_j) = Matern(\|s_i - s_j\|, \sigma, \rho) \]

Where \( \sigma \) is the variance in the spatial effect and \( \rho \) the range.

I want to know more about sigma and rho !!!

rho controls how fast the spatial effect varies, smaller values represent faster changes, larger values slower changes. In other words smaller rho values leads to small scale variations.

plot of chunk unnamed-chunk-15

I want to know more about sigma and rho !!!

sigma control the amplitude of the spatial effect, how high are the peaks and how low are the troughs.

plot of chunk unnamed-chunk-16

Fitting point pattern model in inlabru

mesh <- inla.mesh.2d(loc = data, max.edge = 0.1)
spde <- inla.spde2.pcmatern(mesh,
        prior.sigma = c(10, 0.05),
        prior.range = c(0.1, 0.05))
formula <- coordinates ~ Intercept + 
  spat(coordinates, model = spde)
mod <- lgcp(components = formula,
            data = spdf,
            samplers = spol)

Where spdf is a SpatialPointDataFrame of the observations and spol is a a SpatialPolygonDataFrame of the domain boundary.

Fitting point pattern model in inlabru

drawing

Extracting fitted parameter

range <- spde.posterior(mod, "spat", "range")
sigma <- spde.posterior(mod, "spat", "variance")

drawing

Extracting Matern correlation function

matern <- spde.posterior(mod, "spat", "matern.correlation")

drawing

Model checks with INLA/inlabru

  • Generally possibility to run posterior predictive checks or similar with INLA/inlabru models
  • BUT for point-pattern models we are actually modelling the location of the observations
  • as such we do not have a y response as such but rather a 2D (or 3D) point

Model checks with INLA/inlabru

We can however simulate new locations from the fitted model parameters:

drawing

Model selection with INLA/inlabru

There is also the possibility to compare models using two different information criteria: DIC and WAIC.

inlabru::deltaIC(fit1, fit2)
  Model       DIC Delta.DIC
1  fit1 -162.2933  0.000000
2  fit2 -161.4929  0.800458